The following is a demonstration of how to load COVID-19 cases from some different sources:
library(tidyverse)
library(mapview)
library(sf)
library(tigris)
options(
tigris_class = "sf"
)
We’ll start with the Nature dataset. You will need to clone the nCoV2019 repo to your own machine.
opts_knit$set(root.dir = "~/GitHub/nCoV2019") # change this path if necessary.
The repo conveniently puts the latest data in a folder called “latest_data” in a file called “latestdata.csv”. But the script below demonstrates how you can extract a filename you don’t necessarily know beforehand.
filename <-
list.files(path = "latest_data", full.names=T)
covid_full <-
read_csv(filename) # this loads the data. you should see a big dataframe in your environment.
It’s as simple as that to get the data into RStudio to work with. You’re welcome to explore your own analyses, but for this demonstration, we’ll filter down to total case maps by state and county and do some basic outputs.
Note that the dataset includes lat/long fields (use colnames(covid_full) to easily see field names in your console), as well as other geographic fields that could potentially be used to associate rows with specific geographies.
My preference is to use their lat/long fields to associate every row with a geolocated coordinate, and then determine membership in a geographic boundary using spatial joins. I consider this a more universal approach to deal with hierarchies of geography directly (in county vs. in state), and it takes advantage of the most granular possible information that might be logged. But we’ll need to check the cleanliness of this field.
covid_missing_coords <-
covid_full %>%
filter(is.na(latitude) | is.na(longitude))
covid_missing_coords %>% nrow()
covid_missing_coords %>% filter(country == "United States") %>% nrow()
covid_missing_coords %>% filter(province == "California") %>% nrow()
## [1] 1133
## [1] 51
## [1] 0
The chunk above creates a subset of rows in which lat or long are missing, reports the count, and then reports the count specifically in the U.S. and in California. The missing data seems to be negligible for our purposes, but next we’ll make a version of the data that filters out these rows with missing coordinates and then convert the remaining dataframe into an sf object, where the points have been geolocated.
covid_full_w_coords <-
covid_full %>%
filter(!is.na(latitude) & !is.na(longitude)) %>%
st_as_sf(coords = c("longitude","latitude"), crs = st_crs(4326)) # Google EPSG 4326. I guessed that the data was provided in this standard coordinated system, and using st_crs(4326) helps the function "read" the lat long information in the correct system.
# mapview(covid_full_w_coords$geometry)
# You can map at this point, but it's a lot of points and may take some time, so you can skip this. Notice I've used $ to grab just the geometry field, as mapview would take even longer to render all the other fields of information.
Next, we’ll map by state in the U.S. Note that the dataframe likely has good documentation of state under the fields “province” or “admin1”, but the script below uses st_join based on the point coordinates and Census state polygons.
states <-
states(cb = F, progress_bar = FALSE) %>% # This comes from tigris package. Google tigris r to see full functionality. Also note that in the first chunk, we set a tigris option to automatically load polygons as sf type.
st_transform(4326) # tigris by default is in a different coordinate system, so we have to transform to maintain consistency with our points.
covid_by_state <-
covid_full_w_coords %>% # Recall this already is an sf object with points
st_join(states) %>% # This joins to state polygons and will attach information from the states object to the covid object, row by row. But using st_join() in this direction means the data is still "points"
st_set_geometry(NULL) %>% # We won't need the point geometries anymore. Removing them here will speed up the next two steps.
group_by(NAME) %>% # The NAME field came from the states object. Remember this is maybe redundant with state names that were already in the covid dataset, but we're just being systematic here and not assuming the data is clean.
summarize(
Count = n() # This will collapse the data to just 50 rows for 50 states (6 other territories too), and preserve only state names and a count of rows (cases) of covid data for each state. If you want to retain other useful data as sums, averages, etc., you'd do it here.
) %>%
right_join(
states, # This now attaches our case counts to the states file which has the state polygons, so we can represent the counts on a country map. We're right joining just in case there's a state that doesn't have cases, but we still want to map that state as NA. right_join() preserves the size of the right side.
by = "NAME"
) %>%
st_as_sf() # Since the object in the pipeline is not a sf object (we got rid of geometry), but now we have a geometry field that came from the right_join(), this just gets the dataset to "recognize" a geometry field in itself.
mapview(covid_by_state, zcol = "Count") # This is the easiest way to make interactive maps for a webpage, though it's not necessarily the best for customizing look.